Unit 7 High Availability Solutions Purpose This unit describes how HP's high-availability solutions can assist the high-end customer in maximizing data and system availability. Objectives At the end of this unit, you will be able to: o Explain why high availability is particularly important to high-end customers. o List the major high-availability solutions available from HP. o Describe why HP high-availability solutions are superior to competitive solutions. o Describe key points in HP's vision for the future of high availability. Introduction High availability is a critical aspect of systems management that ensures the consistent and dependable availability of both data and systems. High availability means that access to mission-critical data and applications is maximized by: o Minimizing unplanned down-time o Minimizing or eliminating planned down-time High-availability solutions protect a business's ability to function. This is particularly true with high-end customers because: o Minutes or hours of down-time may mean thousands of dollars in lost productivity, lost revenue, and increased expenses. o High-end configurations are larger, so the impact of unavailable data and resources affects a larger community of users. o The significant investment in hardware warrants a highly available environment. o Operations may be round-the-clock, and any unavailability has a significant effect. In high-end environments, system down-time cannot be tolerated. HP Solutions HP Strategy HP offers a range of products that provide increasing levels of high availability, enabling customers to tailor the best solutions for their environments. The focus is on minimizing both planned and/or unplanned down-time. [Figure: High Availability in the Data Center, caption: none] Key Messages HP continues to provide industry-leading hardware reliability with PA- RISC and VLSI technologies. Today, HP's Corporate Business Systems build on this foundation by providing built-in high-availability features in the hardware. HP's software is reliable and is subjected to rigorous quality testing. In addition, MPE/iX provides a comprehensive software resiliency strategy that anticipates software failures before they occur and takes action to further improve system up-time. In addition to high reliability, HP offers solutions to further enhance data integrity and availability, system availability, and fault and disaster tolerance. HP also has solutions to address the various causes of unplanned failures. [Figure: Causes of Unplanned Failures, caption: none] Highly Reliable Hardware and Software Reliable Hardware For the HP 3000 and HP 9000 The HP 3000 and HP 9000 Corporate Business Systems incorporate high- availability features in their error correcting circuitry, memory arrays, I/O channels, and main processor memory bus. Easy deconfiguration of failed CPUs, memory, or I/O interface modules enable Corporate Business Systems to achieve a maximum hardware up-time 99.97%. Also included are standard features such as automatic system recovery with powerfail (battery backup). Reliable Software Both the MPE/iX and HP-UX operating systems have been designed to offer a superior level of software reliability. System Boot-Up During the system boot-up process, if an error occurs while mounting user volumes, MPE/iX and HP-UX will identify the problem volumes and continue to mount the others. The systems also provide auto configuration at system startup. For the HP 3000 MPE/iX is solid even under heavy loading and is a bullet-proof operating environment. Try/Recover Routines For example, MPE/iX Release 4.0 has been protected in 24 additional ways from system failures. In addition, try/recover routines, which have always existed in MPE/iX, have been implemented in even more places for even higher resiliency. Try/Recover routines enable MPE/iX to recover from error conditions without failing the system. This ongoing commitment to software excellence makes MPE/iX the most robust commercial operating system in the industry, next to MVS/ESA. Table Monitor Another bullet-proof capability comes with the use of a Table Monitor. This feature, available in Q4CY92, monitors table usage and proactively responds to prevent tables from being exceeded. This feature, combined with new larger table sizes, will contribute to making the HP 3000 even more available to users on a daily basis. Aggregate Parallel Recovery MPE/iX Release 4.0 incorporates a new feature called Aggregate Parallel Recovery (APR). This enables system and user volumes to be recovered in parallel rather than serially. This speeds up the system bootup process, increasing system up-time. For the HP 9000 HP-UX is the most mature commercial UN*X operating environment in the industry. Extensive testing is done on all major releases to ensure reliability and resiliency. Data Integrity and Availability Both the HP 3000 and HP 9000 offer a variety of Data Integrity solutions. For the HP 3000 and HP 9000 RAID The use of Redundant Arrays of Inexpensive Disks (RAID), or parity disk arrays, provide added measures of data availability and recovery in the event of a disk failure. The use of disk arrays is very common in high- end environments. [Figure: High Data Availability, caption: none] Disk Mirroring Software disk mirroring further increases the availability of disks beyond disk arrays. A disk array will not remain available if the array controller, interface card, or power supply fails. With mirroring, data is also available from another array with a functioning controller, interface card, or power supply. SCSI Mirroring Mirroring of SCSI disk drives is supported on MPE/iX 4.0 and HP-UX 9.0. For the HP 3000 MPE/iX Transaction Manager MPE/iX features an integrated built-in transaction manager which automatically and transparently provides data integrity for databases, indexed (keyed) sequential access method (KSAM) files, system and file system directories, and other critical system tables. In the event of a system failure or extended power outage, the contents of memory, databases, files, and critical system tables are automatically and transparently restored to a state of data integrity by the transaction monitor. NetBase Shadow Disk mirroring and disk sharing across a network are now available with the NetBase Shadow feature, enhancing data availability on the HP 3000. This feature differs from Mirrored Disk/iX in that it can mirror to more than 12 sites, and across a WAN. Any mirrored set of disks can be used for on-line backup purposes. For the HP 9000 Logical Volume Manager Because of HP's strong commitment to standards, OSF's Logical Volume Manager (LVM) is supported on HP-UX. LVM mirroring maintains up to three copies of data on separate disks. Disks in a mirrored pair or triplet can be taken off-line for backup while applications continue to access data on-line. LVM also provides the capability for files (maximum 2 Gbytes) to span multiple physical volumes, improving performance and availability. System Availability For the HP 3000 and HP 9000 SwitchOver Response to a failure at the system level should be automatic, quickly returning operations back to normal. HP's solutions, such as SPU SwitchOver/iX for MPE/iX and SwitchOver/UX for HP-UX, provide near- continuous operation of mission-critical computing environments. SwitchOver/UX provides for automatic fault detection and recovery of the failed SPU. HP Support Watch HP Predictive Support HP also provides services to minimize unplanned system down-time. HP Support Watch for HP-UX and HP Predictive Support for MPE/iX monitor the system to proactively detect and report hardware faults to the Response Center and system operator before they cause a failure. For the HP 3000 AutoRestart/iX With MPE/iX release 4.0, AutoRestart/iX (included with HP 3000 CS DX) has been enhanced. Compression is available to reduce the amount of disk space needed for a dump, and the need to dedicate an entire disk for dumps has been eliminated. In addition, a toggle has been added to enable customers to turn the autoboot feature in AutoRestart/iX on or off. Disaster Tolerance Natural disasters that interrupt the availability of systems and data can ruin an entire business. High-end data center managers are under pressure to have disaster recovery plans. For the HP 3000 NetBase NetBase for MPE/iX offers wide-area disaster recovery. By automatically maintaining copies of data throughout a geographically dispersed network, NetBase Shadowing ensures the availability of the data in the event of a natural disaster. If a system on the network should go down, one command can redirect file access to an alternate computer, bringing an unavailable application back on-line in a very short period of time. For the HP 3000 and HP 9000 Disaster Recovery HP also offers Disaster Recovery Services to allow customers to plan for a natural disaster. In the event of a disaster, the HP Backup service provides around-the-clock access to fully operational configurations. [Figure: US map, caption: none] Competition Reliable Hardware and Software Although IBM, competing againstIBM's 3090/ES9000 is perceived to be extremely reliable, HP's systems provide higher availability due to a simpler design with fewer parts (the HP 3000 has won DataPro's reliability rating for years). Water cooling on IBM's 3090, for example, introduces additional points of failure. Since HP systems are air- cooled, these points of failure do not exist. MVS/ESA, as software, provides a high degree of fault resilience to system failure. In the event of a near-fatal fault, software errors are trapped via Functional Recovery Routines and are worked around dynamically. MVS/ESA offers more granularity in the way it shuts down, but applications must be specifically coded to take advantage of it. For example, IBM recommends dedicating one database per application. If the database fails, only that one application is unavailable. However, it creates additional programming work and overhead to enable the databases to interact. MPE/iX Release 4.0 incorporates many new software resiliency features to enhance system up-time. It is the most reliable commercial operating system next to MVS/ESA. HP-UX is the most mature and reliable UNIX operating environment in the industry. Data Integrity and Availability Both HP's MPE/iX and IBM's MVS/ESA provide a high degree of data integrity through facilities such as transaction logging. Transaction logging allows for data to be recovered from a log file in the event of a "soft" failure (using rollback recovery), or a "hard" failure (using rollforward recovery). MPE/iX has a key advantage in that it provides this functionality as an integral part of the operating system. Complete transaction management occurs with all applications transparently. MVS/ESA provides this functionality however, it is only implemented via its various OLTP teleprocessing (TP) monitors, CICS, or IMS/DC. OLTP applications are dependent on these additional subsystems for transaction logging. While virtually all MVS/ESA environments utilize these subsystems, having to do so adds to the complexity (and cost) of the system they must manage. MPE/iX provides the most complete, transparent, inherent transaction manager for consistent data integrity. MPE/iX is the most robust commercial operating system in the industry next to MVS. HP and IBM are on a parity level with regard to their disk mirroring and disk array functionality. (In fact, Storage Tek uses HP's disk technology in its "Iceberg" intelligent storage subsystems. HP is investigating Iceberg support for a later date.) MPE/iX has an advantage over IBM in being able to mirror disks across a WAN with NetBase. System Availability IBM and HP are on a parity level with regard to SPU switchover functionality. HP is also on a parity level with DEC clusters. HP has an advantage over the AS/400 with regard to system availability for several reasons. The AS/400 has no transaction manager. After a system failure, disks must be reloaded, otherwise the data is not recovered. This can take more than 8 hours! In addition, the AS/400 does not have a switchover product. Disaster Tolerance The HP 3000 has an advantage over DEC, competing againstDEC and IBM with regard to wide-area disaster tolerance. IBM cannot mirror disks over a WAN, so a remote IBM system cannot take over for a failed primary SPU. DEC has the same problem. And DEC VAXclusters don't provide the WAN feature required for disaster recovery (because the cluster must reside at one location). HP's Vision for the Future HP will incorporate new hardware and software design features to provide even greater availability in the future and further reduce the potential for system failure. These features also will help minimize or eliminate planned down-time for system maintenance. [Figure: The Drive Towards Continuous Availability, caption: none] Planned hardware enhancements include: o On-line replacement for critical elements, including datacomm links, I/O buses, and I/O interfaces o On-line CPU and memory replacement for future HP multiprocessing systems o Hardware resiliency features that prevent component failures from causing a hard crash o "Memory de-allocation" to allow a memory board to detect correctable errors that usually precede a failure and take itself off-line. o "Graceful degradation" to enable a processor board (in a multi- processing system) to anticipate its own failure and take itself off- line without bringing the system down Software enhancements will include "bullet-proofing" all MPE/iX subsystem applications to further enhance software resiliency. Software resiliency features are also planned for future releases of HP-UX to reduce the root cause of system panics. Index Aggregate Parallel Recovery (APR) 7-5 AutoRestart/iX 7-7 DEC, competing against 7-9 Disaster Recovery Services 7-8 Disk mirroring 7-6 IBM, competing against 7-8 Logical Volume Manager 7-7 MPE/iX 7-3, 7-5 NetBase 7-8 NetBase Shadow 7-7 Predictive Support 7-7 Redundant Arrays of Inexpensive Disks (RAID) 7-5 SPU SwitchOver/iX 7-7 Support Watch 7-7 SwitchOver/UX 7-7 Table Monitor 7-5 Transaction manager 7-6 Associated files: U2-06.HPG, WBDB03.GAL, WBDB02.GAL, WBDB01.GAL, WBDB04.GAL, U2-06.HPG, WBDB03.HGL, WBDB02.HGL, WBDB01.HGL, WBDB04.HGL, 7.doc Unit 7 High Availability Solutions